As someone who is thinking about going into the field of Data Science and Machine Learning, I have looked through online job ads for data scientists and got overwhelmed by the amount of skills required. It would take a very long time to learn the whole repertoire of data science tools.
To help decide which tools to learn first, I decided to look at which data science skills are currently most in-demand by employers in major cities in Canada and U.S.
Just like any sane person would do these days, I googled “web scraping indeed” to see if I can scrape job ads to analyse them. I came across this awesome blog by Jesse Steinweg-Woods published in March 2015. He looked at the data science tools mentioned in the job ads aggregated by indeed.com for Seattle, Chicago, San Francisco, New York and Nationwide.
His findings from nationwide search suggested that Python is much more in demand than R. This really intrigued me to see if the trends he found a year ago have changed.
My initial goals were
However, as I searched for job ads for data scientists in Canada, I came to realize that there weren’t enough ads in most of the cities. So I decided to focus on the national trend for Canada instead.
As for U.S., I initially wanted to look at the top 100 largest cities. During the data collection process, some of the cities didn’t have enough or no ads either. So I decided to focus on the following 53 cities, that had enough job ads for data scientists, instead.
During my project meeting with David Robinson, who guided me through this project, he suggested that it would be interesting to look at the trends between data scientists and data analysts as well. So I decided to look at the skills, that the employers are looking for other data-driven jobs (specifically for Data Analysts, Data Architects and Data Engineers) as well.
I used indeed.com and indeed.ca to gather the job ads data. I also received job ads data from stackoverflow, thanks to David who works there as a data scientist! Due to the confidentiality agreement, I cannot share the stackoverflow data. However, you can find all the data that I scraped off indeed in my github repository.
I wanted to use rvest to scrape data from indeed. However, after spending a few days of trying, I couldn’t get the scraping function to work, since indeed is a job search aggregator and the links it provides go to different company websites that have different structure. This is my first time scraping so I probably didn’t know enough R tools to make it work. You can see a few of my failed attempts in my github repo.
So, I decided to see if I could modify Jesse’s ipython code to get .csv files of the wordcounts (names of the Data Science Tools) from the job ads. His code was written to produce plots of the results right after scraping but not to produce .csv files. I contacted Jesse to get permission to use his code for this project and he kindly gave me permission and some tips.
You can find the ipython notebooks that I used to scrape indeed here.
You may see in the ipython notebooks that the code was a bit repetitive. I was trying to stick to d.r.y (don’t repeat yourself) principle but I have never programmed in Python, so this will be a work in progress for me to make the code neater. In the meantime, this code was enough to collect the data.
Data was collected for job ads that had exactly “Data Analyst”, “Data Scientist”, “Data Architect” or “Data Engineer” in its text. Since the search was done for each job title for each of the 53 cities mentioned above, I ended up with 212 (53 times 4) .csv files. I had to do quite a bit of wrangling to comebine them together.
Out of the 53 cities, 14 of them had less than 10 or no job postings for data engineers. Therefore, I used 39 cities to compare the skills among “Data Anlysts”, “Data Scientists”, “Data Architects” and “Data Engineers”.
The data collected include the percentage of job ads that contain the names of the analysis tools or programming languages, specified below, and the number of job postings per each job title in each city.
library(readr)
library(tidyr)
library(dplyr)
Since I have separate .csv files that contain columns term (name of Data Science Tools) & NumPostings (Percentage of Job Ads that contain the term) for DAnalysts, DScientists, DArchitects and DEngineer per city. I’m combining them to analyse the data.
I realized, after doing the following lengthy process, that I could have written a function to do it. However, while I was initially doing it, I couldn’t figure it out. Since I’ve already written .csv files for all the cities, I’m just showing the work I did below. You can skip to “Specific Research Questions”.
You can find the data used for the following wrangling here.
Joining datasets:
#NewYork
NewYork1 <- read_csv("NewYorkSkills1.csv") #.csv for Data Analysts' job ad data
NewYork2 <- read_csv("NewYorkSkills2.csv") #.csv for Data Scientists' job ad data
NewYork3 <- read_csv("NewYorkSkills3.csv") #.csv for Data Architects' job ad data
NewYork4 <- read_csv("NewYorkSkills4.csv") #.csv for Data Engineers' job ad data
names(NewYork1) <- c("Term", "DAnalyst")
names(NewYork2) <- c("Term", "DScientist")
names(NewYork3) <- c("Term", "DArchitect")
names(NewYork4) <- c("Term", "DEngineer")
j1 <- full_join(NewYork1, NewYork2, by="Term")
j2 <- full_join(NewYork3, NewYork4, by="Term")
NewYork <- full_join(j1, j2, by="Term")
NewYork$City <- "New York"
NewYork$State <- "NY"
NewYork <- replace_na(NewYork, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
#Since NAs mean the term didn't appear in the job postings and thus assigned zero.
write_csv( NewYork, path = "NewYork.csv")
#Anaheim
Anaheim1 <- read_csv("AnaheimSkills1.csv")
Anaheim2 <- read_csv("AnaheimSkills2.csv")
Anaheim3 <- read_csv("AnaheimSkills3.csv")
Anaheim4 <- read_csv("AnaheimSkills4.csv")
names(Anaheim1) <- c("Term", "DAnalyst")
names(Anaheim2) <- c("Term", "DScientist")
names(Anaheim3) <- c("Term", "DArchitect")
names(Anaheim4) <- c("Term", "DEngineer")
j1 <- full_join(Anaheim1, Anaheim2, by="Term")
j2 <- full_join(Anaheim3, Anaheim4, by="Term")
Anaheim <- full_join(j1, j2, by="Term")
Anaheim$City <- "Anaheim"
Anaheim$State <- "CA"
Anaheim <- replace_na(Anaheim, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Anaheim, path = "Anaheim.csv")
#Arlington
Arlington1 <- read_csv("ArlingtonSkills1.csv")
Arlington2 <- read_csv("ArlingtonSkills2.csv")
Arlington3 <- read_csv("ArlingtonSkills3.csv")
Arlington4 <- read_csv("ArlingtonSkills4.csv")
names(Arlington1) <- c("Term", "DAnalyst")
names(Arlington2) <- c("Term", "DScientist")
names(Arlington3) <- c("Term", "DArchitect")
names(Arlington4) <- c("Term", "DEngineer")
j1 <- full_join(Arlington1, Arlington2, by="Term")
j2 <- full_join(Arlington3, Arlington4, by="Term")
Arlington <- full_join(j1, j2, by="Term")
Arlington$City <- "Arlington"
Arlington$State <- "TX"
Arlington <- replace_na(Arlington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Arlington, path = "Arlington.csv")
#Atlanta
Atlanta1 <- read_csv("AtlantaSkills1.csv")
Atlanta2 <- read_csv("AtlantaSkills2.csv")
Atlanta3 <- read_csv("AtlantaSkills3.csv")
Atlanta4 <- read_csv("AtlantaSkills4.csv")
names(Atlanta1) <- c("Term", "DAnalyst")
names(Atlanta2) <- c("Term", "DScientist")
names(Atlanta3) <- c("Term", "DArchitect")
names(Atlanta4) <- c("Term", "DEngineer")
j1 <- full_join(Atlanta1, Atlanta2, by="Term")
j2 <- full_join(Atlanta3, Atlanta4, by="Term")
Atlanta <- full_join(j1, j2, by="Term")
Atlanta$City <- "Atlanta"
Atlanta$State <- "GA"
Atlanta <- replace_na(Atlanta, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Atlanta, path = "Atlanta.csv")
#Aurora
Aurora1 <- read_csv("AuroraSkills1.csv")
Aurora2 <- read_csv("AuroraSkills2.csv")
Aurora3 <- read_csv("AuroraSkills3.csv")
Aurora4 <- read_csv("AuroraSkills4.csv")
names(Aurora1) <- c("Term", "DAnalyst")
names(Aurora2) <- c("Term", "DScientist")
names(Aurora3) <- c("Term", "DArchitect")
names(Aurora4) <- c("Term", "DEngineer")
j1 <- full_join(Aurora1, Aurora2, by="Term")
j2 <- full_join(Aurora3, Aurora4, by="Term")
Aurora <- full_join(j1, j2, by="Term")
Aurora$City <- "Aurora"
Aurora$State <- "CO"
Aurora <- replace_na(Aurora, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Aurora, path = "Aurora.csv")
#Austin
Austin1 <- read_csv("AustinSkills1.csv")
Austin2 <- read_csv("AustinSkills2.csv")
Austin3 <- read_csv("AustinSkills3.csv")
Austin4 <- read_csv("AustinSkills4.csv")
names(Austin1) <- c("Term", "DAnalyst")
names(Austin2) <- c("Term", "DScientist")
names(Austin3) <- c("Term", "DArchitect")
names(Austin4) <- c("Term", "DEngineer")
j1 <- full_join(Austin1, Austin2, by="Term")
j2 <- full_join(Austin3, Austin4, by="Term")
Austin <- full_join(j1, j2, by="Term")
Austin$City <- "Austin"
Austin$State <- "TX"
Austin <- replace_na(Austin, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv( Austin, path = "Austin.csv")
#Baltimore
Baltimore1 <- read_csv("BaltimoreSkills1.csv")
Baltimore2 <- read_csv("BaltimoreSkills2.csv")
Baltimore3 <- read_csv("BaltimoreSkills3.csv")
Baltimore4 <- read_csv("BaltimoreSkills4.csv")
names(Baltimore1) <- c("Term", "DAnalyst")
names(Baltimore2) <- c("Term", "DScientist")
names(Baltimore3) <- c("Term", "DArchitect")
names(Baltimore4) <- c("Term", "DEngineer")
j1 <- full_join(Baltimore1, Baltimore2, by="Term")
j2 <- full_join(Baltimore3, Baltimore4, by="Term")
Baltimore <- full_join(j1, j2, by="Term")
Baltimore$City <- "Baltimore"
Baltimore$State <- "MD"
Baltimore <- replace_na(Baltimore, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Baltimore, path = "Baltimore.csv")
#Boston
Boston1 <- read_csv("BostonSkills1.csv")
Boston2 <- read_csv("BostonSkills2.csv")
Boston3 <- read_csv("BostonSkills3.csv")
Boston4 <- read_csv("BostonSkills4.csv")
names(Boston1) <- c("Term", "DAnalyst")
names(Boston2) <- c("Term", "DScientist")
names(Boston3) <- c("Term", "DArchitect")
names(Boston4) <- c("Term", "DEngineer")
j1 <- full_join(Boston1, Boston2, by="Term")
j2 <- full_join(Boston3, Boston4, by="Term")
Boston <- full_join(j1, j2, by="Term")
Boston$City <- "Boston"
Boston$State <- "MA"
Boston <- replace_na(Boston, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Boston, path = "Boston.csv")
#Boulder
Boulder1 <- read_csv("BoulderSkills1.csv")
Boulder2 <- read_csv("BoulderSkills2.csv")
Boulder3 <- read_csv("BoulderSkills3.csv")
Boulder4 <- read_csv("BoulderSkills4.csv")
names(Boulder1) <- c("Term", "DAnalyst")
names(Boulder2) <- c("Term", "DScientist")
names(Boulder3) <- c("Term", "DArchitect")
names(Boulder4) <- c("Term", "DEngineer")
j1 <- full_join(Boulder1, Boulder2, by="Term")
j2 <- full_join(Boulder3, Boulder4, by="Term")
Boulder <- full_join(j1, j2, by="Term")
Boulder$City <- "Boulder"
Boulder$State <- "CO"
Boulder <- replace_na(Boulder, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Boulder, path = "Boulder.csv")
#Chandler
a <- read_csv("ChandlerSkills1.csv")
b <- read_csv("ChandlerSkills2.csv")
c <- read_csv("ChandlerSkills3.csv")
d <- read_csv("ChandlerSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Chandler <- full_join(j1, j2, by="Term")
Chandler$City <- "Chandler"
Chandler$State <- "AZ"
Chandler <- replace_na(Chandler, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Chandler, path = "Chandler.csv")
#Charlotte
a <- read_csv("CharlotteSkills1.csv")
b <- read_csv("CharlotteSkills2.csv")
c <- read_csv("CharlotteSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Charlotte <- full_join(j1, c, by="Term")
Charlotte$City <- "Charlotte"
Charlotte$State <- "NC"
Charlotte <- replace_na(Charlotte, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Charlotte, path = "Charlotte.csv")
#Chicago
Chicago1 <- read_csv("ChicagoSkills1.csv")
Chicago2 <- read_csv("ChicagoSkills2.csv")
Chicago3 <- read_csv("ChicagoSkills3.csv")
Chicago4 <- read_csv("ChicagoSkills4.csv")
names(Chicago1) <- c("Term", "DAnalyst")
names(Chicago2) <- c("Term", "DScientist")
names(Chicago3) <- c("Term", "DArchitect")
names(Chicago4) <- c("Term", "DEngineer")
j1 <- full_join(Chicago1, Chicago2, by="Term")
j2 <- full_join(Chicago3, Chicago4, by="Term")
Chicago <- full_join(j1, j2, by="Term")
Chicago$City <- "Chicago"
Chicago$State <- "IL"
Chicago <- replace_na(Chicago, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Chicago, path = "Chicago.csv")
#Cincinnati
a <- read_csv("CincinnatiSkills1.csv")
b <- read_csv("CincinnatiSkills2.csv")
c <- read_csv("CincinnatiSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Cincinnati <- full_join(j1, c, by="Term")
Cincinnati$City <- "Cincinnati"
Cincinnati$State <- "OH"
Cincinnati <- replace_na(Cincinnati, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Cincinnati, path = "Cincinnati.csv")
#Columbus
a <- read_csv("ColumbusSkills1.csv")
b <- read_csv("ColumbusSkills2.csv")
c <- read_csv("ColumbusSkills3.csv")
d <- read_csv("ColumbusSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Columbus <- full_join(j1, j2, by="Term")
Columbus$City <- "Columbus"
Columbus$State <- "OH"
Columbus <- replace_na(Columbus, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Columbus, path = "Columbus.csv")
#Dallas
a <- read_csv("DallasSkills1.csv")
b <- read_csv("DallasSkills2.csv")
c <- read_csv("DallasSkills3.csv")
d <- read_csv("DallasSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Dallas <- full_join(j1, j2, by="Term")
Dallas$City <- "Dallas"
Dallas$State <- "TX"
Dallas <- replace_na(Dallas, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Dallas, path = "Dallas.csv")
#Denver
a <- read_csv("DenverSkills1.csv")
b <- read_csv("DenverSkills2.csv")
c <- read_csv("DenverSkills3.csv")
d <- read_csv("DenverSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Denver <- full_join(j1, j2, by="Term")
Denver$City <- "Denver"
Denver$State <- "CO"
Denver <- replace_na(Denver, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Denver, path = "Denver.csv")
#Detroit
a <- read_csv("DetroitSkills1.csv")
b <- read_csv("DetroitSkills2.csv")
c <- read_csv("DetroitSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Detroit <- full_join(j1, c, by="Term")
Detroit$City <- "Detroit"
Detroit$State <- "MI"
Detroit <- replace_na(Detroit, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Detroit, path = "Detroit.csv")
#Durham
a <- read_csv("DurhamSkills1.csv")
b <- read_csv("DurhamSkills2.csv")
c <- read_csv("DurhamSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Durham <- full_join(j1, c, by="Term")
Durham$City <- "Durham"
Durham$State <- "NC"
Durham <- replace_na(Durham, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Durham, path = "Durham.csv")
#Fremont
a <- read_csv("FremontSkills1.csv")
b <- read_csv("FremontSkills2.csv")
c <- read_csv("FremontSkills3.csv")
d <- read_csv("FremontSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Fremont <- full_join(j1, j2, by="Term")
Fremont$City <- "Fremont"
Fremont$State <- "CA"
Fremont <- replace_na(Fremont, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Fremont, path = "Fremont.csv")
#FortWorth
a <- read_csv("FortWorthSkills1.csv")
b <- read_csv("FortWorthSkills2.csv")
c <- read_csv("FortWorthSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
FortWorth <- full_join(j1, c, by="Term")
FortWorth$City <- "FortWorth"
FortWorth$State <- "TX"
FortWorth <- replace_na(FortWorth, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(FortWorth, path = "FortWorth.csv")
#Garland
a <- read_csv("GarlandSkills1.csv")
b <- read_csv("GarlandSkills2.csv")
c <- read_csv("GarlandSkills3.csv")
d <- read_csv("GarlandSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Garland <- full_join(j1, j2, by="Term")
Garland$City <- "Garland"
Garland$State <- "TX"
Garland <- replace_na(Garland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Garland, path = "Garland.csv")
#Gilbert
a <- read_csv("GilbertSkills1.csv")
b <- read_csv("GilbertSkills2.csv")
c <- read_csv("GilbertSkills3.csv")
d <- read_csv("GilbertSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Gilbert <- full_join(j1, j2, by="Term")
Gilbert$City <- "Gilbert"
Gilbert$State <- "AZ"
Gilbert <- replace_na(Gilbert, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Gilbert, path = "Gilbert.csv")
#Glendale
a <- read_csv("GlendaleSkills1.csv")
b <- read_csv("GlendaleSkills2.csv")
c <- read_csv("GlendaleSkills3.csv")
d <- read_csv("GlendaleSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Glendale <- full_join(j1, j2, by="Term")
Glendale$City <- "Glendale"
Glendale$State <- "AZ"
Glendale <- replace_na(Glendale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Glendale, path = "Glendale.csv")
#Hialeah
a <- read_csv("HialeahSkills1.csv")
b <- read_csv("HialeahSkills2.csv")
c <- read_csv("HialeahSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Hialeah <- full_join(j1, c, by="Term")
Hialeah$City <- "Hialeah"
Hialeah$State <- "FL"
Hialeah <- replace_na(Hialeah, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Hialeah, path = "Hialeah.csv")
#Houston
a <- read_csv("HoustonSkills1.csv")
b <- read_csv("HoustonSkills2.csv")
c <- read_csv("HoustonSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Houston <- full_join(j1, c, by="Term")
Houston$City <- "Houston"
Houston$State <- "TX"
Houston <- replace_na(Houston, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Houston, path = "Houston.csv")
#Irvine
a <- read_csv("IrvineSkills1.csv")
b <- read_csv("IrvineSkills2.csv")
c <- read_csv("IrvineSkills3.csv")
d <- read_csv("IrvineSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Irvine <- full_join(j1, j2, by="Term")
Irvine$City <- "Irvine"
Irvine$State <- "CA"
Irvine <- replace_na(Irvine, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Irvine, path = "Irvine.csv")
#Irving
a <- read_csv("IrvingSkills1.csv")
b <- read_csv("IrvingSkills2.csv")
c <- read_csv("IrvingSkills3.csv")
d <- read_csv("IrvingSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Irving <- full_join(j1, j2, by="Term")
Irving$City <- "Irving"
Irving$State <- "TX"
Irving <- replace_na(Irving, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Irving, path = "Irving.csv")
#Jersey
a <- read_csv("JerseySkills1.csv")
b <- read_csv("JerseySkills2.csv")
c <- read_csv("JerseySkills3.csv")
d <- read_csv("JerseySkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Jersey <- full_join(j1, j2, by="Term")
Jersey$City <- "Jersey"
Jersey$State <- "NJ"
Jersey <- replace_na(Jersey, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Jersey, path = "Jersey.csv")
#LongBeach
a <- read_csv("LongBeachSkills1.csv")
b <- read_csv("LongBeachSkills2.csv")
c <- read_csv("LongBeachSkills3.csv")
d <- read_csv("LongBeachSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
LongBeach <- full_join(j1, j2, by="Term")
LongBeach$City <- "LongBeach"
LongBeach$State <- "CA"
LongBeach <- replace_na(LongBeach, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(LongBeach, path = "LongBeach.csv")
#LosAngeles
a <- read_csv("LosAngelesSkills1.csv")
b <- read_csv("LosAngelesSkills2.csv")
c <- read_csv("LosAngelesSkills3.csv")
d <- read_csv("LosAngelesSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
LosAngeles <- full_join(j1, j2, by="Term")
LosAngeles$City <- "Los Angeles"
LosAngeles$State <- "CA"
LosAngeles <- replace_na(LosAngeles, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(LosAngeles, path = "LosAngeles.csv")
#Mesa
a <- read_csv("MesaSkills1.csv")
b <- read_csv("MesaSkills2.csv")
c <- read_csv("MesaSkills3.csv")
d <- read_csv("MesaSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Mesa <- full_join(j1, j2, by="Term")
Mesa$City <- "Mesa"
Mesa$State <- "AZ"
Mesa <- replace_na(Mesa, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Mesa, path = "Mesa.csv")
#Miami
a <- read_csv("MiamiSkills1.csv")
b <- read_csv("MiamiSkills2.csv")
c <- read_csv("MiamiSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Miami <- full_join(j1, c, by="Term")
Miami$City <- "Miami"
Miami$State <- "FL"
Miami <- replace_na(Miami, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Miami, path = "Miami.csv")
#Minneapolis
a <- read_csv("MinneapolisSkills1.csv")
b <- read_csv("MinneapolisSkills2.csv")
c <- read_csv("MinneapolisSkills3.csv")
d <- read_csv("MinneapolisSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Minneapolis <- full_join(j1, j2, by="Term")
Minneapolis$City <- "Minneapolis"
Minneapolis$State <- "MN"
Minneapolis <- replace_na(Minneapolis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Minneapolis, path = "Minneapolis.csv")
#Nashville
a <- read_csv("NashvilleSkills1.csv")
b <- read_csv("NashvilleSkills2.csv")
c <- read_csv("NashvilleSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Nashville <- full_join(j1, c, by="Term")
Nashville$City <- "Nashville"
Nashville$State <- "TN"
Nashville <- replace_na(Nashville, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Nashville, path = "Nashville.csv")
#Newark
a <- read_csv("NewarkSkills1.csv")
b <- read_csv("NewarkSkills2.csv")
c <- read_csv("NewarkSkills3.csv")
d <- read_csv("NewarkSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Newark <- full_join(j1, j2, by="Term")
Newark$City <- "Newark"
Newark$State <- "NJ"
Newark <- replace_na(Newark, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Newark, path = "Newark.csv")
#Oakland
a <- read_csv("OaklandSkills1.csv")
b <- read_csv("OaklandSkills2.csv")
c <- read_csv("OaklandSkills3.csv")
d <- read_csv("OaklandSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Oakland <- full_join(j1, j2, by="Term")
Oakland$City <- "Oakland"
Oakland$State <- "CA"
Oakland <- replace_na(Oakland, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Oakland, path = "Oakland.csv")
#Philadelphia
a <- read_csv("PhiladelphiaSkills1.csv")
b <- read_csv("PhiladelphiaSkills2.csv")
c <- read_csv("PhiladelphiaSkills3.csv")
d <- read_csv("PhiladelphiaSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Philadelphia <- full_join(j1, j2, by="Term")
Philadelphia$City <- "Philadelphia"
Philadelphia$State <- "PA"
Philadelphia <- replace_na(Philadelphia, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Philadelphia, path = "Philadelphia.csv")
#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Pittsburgh <- full_join(j1, c, by="Term")
Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"
Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Pittsburgh, path = "Pittsburgh.csv")
#Phoenix
a <- read_csv("PhoenixSkills1.csv")
b <- read_csv("PhoenixSkills2.csv")
c <- read_csv("PhoenixSkills3.csv")
d <- read_csv("PhoenixSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Phoenix <- full_join(j1, j2, by="Term")
Phoenix$City <- "Phoenix"
Phoenix$State <- "AZ"
Phoenix <- replace_na(Phoenix, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Phoenix, path = "Phoenix.csv")
#Plano
a <- read_csv("PlanoSkills1.csv")
b <- read_csv("PlanoSkills2.csv")
c <- read_csv("PlanoSkills3.csv")
d <- read_csv("PlanoSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Plano <- full_join(j1, j2, by="Term")
Plano$City <- "Plano"
Plano$State <- "TX"
Plano <- replace_na(Plano, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Plano, path = "Plano.csv")
#Pittsburgh
a <- read_csv("PittsburghSkills1.csv")
b <- read_csv("PittsburghSkills2.csv")
c <- read_csv("PittsburghSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Pittsburgh <- full_join(j1, c, by="Term")
Pittsburgh$City <- "Pittsburgh"
Pittsburgh$State <- "PA"
Pittsburgh <- replace_na(Pittsburgh, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Pittsburgh, path = "Pittsburgh.csv")
#Portland
a <- read_csv("PortlandSkills1.csv")
b <- read_csv("PortlandSkills2.csv")
c <- read_csv("PortlandSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Portland <- full_join(j1, c, by="Term")
Portland$City <- "Portland"
Portland$State <- "OR"
Portland <- replace_na(Portland, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Portland, path = "Portland.csv")
#Raleigh
a <- read_csv("RaleighSkills1.csv")
b <- read_csv("RaleighSkills2.csv")
c <- read_csv("RaleighSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Raleigh <- full_join(j1, c, by="Term")
Raleigh$City <- "Raleigh"
Raleigh$State <- "NC"
Raleigh <- replace_na(Raleigh, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Raleigh, path = "Raleigh.csv")
#SanDiego
a <- read_csv("SanDiegoSkills1.csv")
b <- read_csv("SanDiegoSkills2.csv")
c <- read_csv("SanDiegoSkills3.csv")
d <- read_csv("SanDiegoSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SanDiego <- full_join(j1, j2, by="Term")
SanDiego$City <- "SanDiego"
SanDiego$State <- "CA"
SanDiego <- replace_na(SanDiego, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SanDiego, path = "SanDiego.csv")
#SanFrancisco
a <- read_csv("SanFranciscoSkills1.csv")
b <- read_csv("SanFranciscoSkills2.csv")
c <- read_csv("SanFranciscoSkills3.csv")
d <- read_csv("SanFranciscoSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SanFrancisco <- full_join(j1, j2, by="Term")
SanFrancisco$City <- "San Francisco"
SanFrancisco$State <- "CA"
SanFrancisco <- replace_na(SanFrancisco, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SanFrancisco, path = "SanFrancisco.csv")
#SanJose
a <- read_csv("SanJoseSkills1.csv")
b <- read_csv("SanJoseSkills2.csv")
c <- read_csv("SanJoseSkills3.csv")
d <- read_csv("SanJoseSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SanJose <- full_join(j1, j2, by="Term")
SanJose$City <- "San Jose"
SanJose$State <- "CA"
SanJose <- replace_na(SanJose, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SanJose, path = "SanJose.csv")
#SantaAna
a <- read_csv("SantaAnaSkills1.csv")
b <- read_csv("SantaAnaSkills2.csv")
c <- read_csv("SantaAnaSkills3.csv")
d <- read_csv("SantaAnaSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
SantaAna <- full_join(j1, j2, by="Term")
SantaAna$City <- "Santa Ana"
SantaAna$State <- "CA"
SantaAna <- replace_na(SantaAna, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(SantaAna, path = "SantaAna.csv")
#Scottsdale
a <- read_csv("ScottsdaleSkills1.csv")
b <- read_csv("ScottsdaleSkills2.csv")
c <- read_csv("ScottsdaleSkills3.csv")
d <- read_csv("ScottsdaleSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Scottsdale <- full_join(j1, j2, by="Term")
Scottsdale$City <- "Scottsdale"
Scottsdale$State <- "AZ"
Scottsdale <- replace_na(Scottsdale, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Scottsdale, path = "Scottsdale.csv")
#Seattle
a <- read_csv("SeattleSkills1.csv")
b <- read_csv("SeattleSkills2.csv")
c <- read_csv("SeattleSkills3.csv")
d <- read_csv("SeattleSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Seattle <- full_join(j1, j2, by="Term")
Seattle$City <- "Seattle"
Seattle$State <- "WA"
Seattle <- replace_na(Seattle, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Seattle, path = "Seattle.csv")
#St.Louis
a <- read_csv("St.LouisSkills1.csv")
b <- read_csv("St.LouisSkills2.csv")
c <- read_csv("St.LouisSkills3.csv")
d <- read_csv("St.LouisSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
St.Louis <- full_join(j1, j2, by="Term")
St.Louis$City <- "St. Louis"
St.Louis$State <- "MO"
St.Louis <- replace_na(St.Louis, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(St.Louis, path = "St.Louis.csv")
#St.Paul
a <- read_csv("St.PaulSkills1.csv")
b <- read_csv("St.PaulSkills2.csv")
c <- read_csv("St.PaulSkills3.csv")
d <- read_csv("St.PaulSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
St.Paul <- full_join(j1, j2, by="Term")
St.Paul$City <- "St. Paul"
St.Paul$State <- "MN"
St.Paul <- replace_na(St.Paul, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(St.Paul, path = "St.Paul.csv")
#St.Petersburg
a <- read_csv("St.PetersburgSkills1.csv")
b <- read_csv("St.PetersburgSkills2.csv")
c <- read_csv("St.PetersburgSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
St.Petersburg <- full_join(j1, c, by="Term")
St.Petersburg$City <- "St.Petersburg"
St.Petersburg$State <- "FL"
St.Petersburg <- replace_na(St.Petersburg, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(St.Petersburg, path = "St.Petersburg.csv")
#Tampa
a <- read_csv("TampaSkills1.csv")
b <- read_csv("TampaSkills2.csv")
c <- read_csv("TampaSkills3.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
j1 <- full_join(a, b, by="Term")
Tampa <- full_join(j1, c, by="Term")
Tampa$City <- "Tampa"
Tampa$State <- "FL"
Tampa<- replace_na(Tampa, list(DAnalyst=0, DScientist=0, DArchitect=0))
write_csv(Tampa, path = "Tampa.csv")
#Washington
a <- read_csv("WashingtonSkills1.csv")
b <- read_csv("WashingtonSkills2.csv")
c <- read_csv("WashingtonSkills3.csv")
d <- read_csv("WashingtonSkills4.csv")
names(a) <- c("Term", "DAnalyst")
names(b) <- c("Term", "DScientist")
names(c) <- c("Term", "DArchitect")
names(d) <- c("Term", "DEngineer")
j1 <- full_join(a, b, by="Term")
j2 <- full_join(c, d, by="Term")
Washington <- full_join(j1, j2, by="Term")
Washington$City <- "Washington"
Washington$State <- "DC"
Washington <- replace_na(Washington, list(DAnalyst=0, DScientist=0, DEngineer=0, DArchitect=0))
write_csv(Washington, path = "Washington.csv")
All the .csv files created from above can be found here.
Combining all cities:
master_39cities <- union(Anaheim, Arlington) %>% union(Atlanta) %>%
union(Aurora) %>% union(Austin) %>% union(Baltimore) %>%
union(Boston) %>% union(Boulder) %>% union(Chicago) %>%
union(Dallas) %>% union(Denver) %>% union(Fremont) %>%
union(Garland) %>% union(Irving) %>% union(Jersey) %>%
union(LongBeach) %>% union(LosAngeles) %>% union(Minneapolis) %>%
union(Newark) %>% union(NewYork) %>% union(Oakland) %>%
union(Philadelphia) %>% union(Plano) %>% union(SanFrancisco) %>%
union(SanJose) %>% union(SantaAna) %>% union(Seattle) %>%
union(St.Louis) %>% union(St.Paul) %>% union(Washington) %>%
union(Chandler) %>% union(Columbus) %>% union(Gilbert) %>%
union(Glendale) %>% union(Irvine) %>% union(Mesa) %>%
union(Phoenix) %>% union(SanDiego) %>% union(Scottsdale)
write_csv(master_39cities, path="master_39cities.csv")
After the lengthy process of wrangling and combining the data, I now have three .csv files:
master_39cities
Term = Name of data science toolsDAnalyst= % of job ads for data analysts that contained the TermDScientist = % of job ads for data scientists that contained the TermDArchitect = % of job ads for data scientists that contained the TermDEngineer = % of job ads for data scientists that contained the TermCity = U.S. CitiesState= U.S. Statesmaster_53cities
master_53citiesnum_posting_US
City, StateDAst_num = Number of job ads for Data Analysts in each of the 53 citiesDS_num = Number of job ads for Data Scientists in each of the 53 citiesDAr_num = Number of job ads for Data Architects in each of the 53 citiesDE_num = Number of job ads for Data Engineers in each of the 53 citiesAfter the data collection process, I narrowed down my research questions to the following.
library(ggplot2)
library(ggrepel)
library(gridExtra)
library(stringr)
library(wordcloud)
library(RColorBrewer)
library(tm)
library(SnowballC)
Wrangling of num_posting data:
num_posting <- read_csv("num_posting_US.csv")
#dataset that contains the number of job postings for each job title in each of the 53 cities
num_posting <- num_posting %>%
separate(col=`City, State`, into=c("City", "State"), sep=",")
names(num_posting) <- c("City", "State", "DAnalyst", "DScientist", "DArchitect", "DEngineer")
num_posting <- replace_na(num_posting, list(DEngineer=0)) #Since the NAs for DEngineer are from cities that didn't have enough job postings (less than 10). For ease of analysis, I will consider them to be equal to zero.
num_posting_long <- num_posting %>%
gather(job_title, num_posting, DAnalyst:DEngineer) #tidy data
num_posting %>%
ggplot() +
geom_line(aes(City, DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(City, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(City, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(City, DEngineer, group=1, color="Data Engineer")) +
ylab("Number of job postings") +
ggtitle("Number of job ads in 53 U.S. cities") +
theme( text=element_text(size=22),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
You can see that almost all 53 cities had the most number of job ads for Data analysts followed by those for Data scientists, except for San Francisco, San Jose and Seattle.
There were similar number of job ads for data engineers and data architects, except for New York, Newark, Jersey City, San Jose, San Francisco, Seattle, Oakland and Fremont where there was a significantly higher number of job ads for data engineers than data architects.
num_posting_long %>%
group_by(City) %>%
summarize(total_num_posting = sum(num_posting)) %>%
ggplot(aes(City, total_num_posting))+
geom_point(color="purple") +
ylab("Number of job postings") +
ggtitle("Total number of job postings for Data Analysts, Data Scientists, Data Architects & Data Engineers") +
theme( text=element_text(size=22),
axis.text.x=element_text(angle=90, vjust=1),
legend.text = element_text(size = 20))
In Newark, Jersey City, New York, Oakland, Washington, Fremont and San Francisco, there were a total of over 1000 job ads for Data Analysts, Data Scientists, Data Architects & Data Engineers combined together. Boston & Seattle had over 500 whereas San Jose had over 800.
num_posting_long %>%
group_by(City) %>%
summarize(total_num_posting = sum(num_posting)) %>%
tbl_df() %>%
arrange(desc(total_num_posting))
## Source: local data frame [53 x 2]
##
## City total_num_posting
## (chr) (dbl)
## 1 Newark 1410
## 2 Jersey City 1394
## 3 New York 1333
## 4 Oakland 1066
## 5 Washington 1059
## 6 Fremont 1041
## 7 San Francisco 1014
## 8 San Jose 817
## 9 Boston 657
## 10 Seattle 562
## .. ... ...
Creating a master dataset for Canada:
You can find the datasets for Canada here.
canada1 <- read_csv("canada1.csv")
canada2 <- read_csv("canada2.csv")
canada3 <- read_csv("canada3.csv")
canada4 <- read_csv("canada4.csv")
names(canada1) <- c("Term", "DAnalyst")
names(canada2) <- c("Term", "DScientist")
names(canada3) <- c("Term", "DArchitect")
names(canada4) <- c("Term", "DEngineer")
j1 <- full_join(canada1, canada2, by="Term")
j2 <- full_join(canada3, canada4, by="Term")
canada <- full_join(j1, j2, by="Term")
canada <- replace_na(canada, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0))
canada_long <- canada %>%
gather(job_title, percentage, DAnalyst: DEngineer) #tidy
canada %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
xlab("") +
ylab("Percentage appearing in job ads") +
ggtitle("Data Science Tools in demand in Canada") +
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
For data analysts, 43.2% of job ads asked for SQL, 41.18% for Excel, 16.1% for R, 16.1% for Tableau, 12.3% for Python and 10.5% for SAS.
canada %>%
select(Term, DAnalyst) %>%
arrange(desc(DAnalyst))
## Source: local data frame [27 x 2]
##
## Term DAnalyst
## (chr) (dbl)
## 1 SQL 43.222506
## 2 Excel 41.176471
## 3 R 16.112532
## 4 Tableau 16.112532
## 5 Python 12.276215
## 6 SAS 10.485934
## 7 JavaScript 10.230179
## 8 Perl 8.695652
## 9 Java 8.439898
## 10 Hadoop 6.649616
## .. ... ...
For data scientists, 34.4% of job ads asked for R, 34.4% for Python, 30.5% for SQL, 23.4% for Java, 19.5% for Tableau, 18.8% for Hadoop and 11.7% for Excel.
canada %>%
select(Term, DScientist) %>%
arrange(desc(DScientist))
## Source: local data frame [27 x 2]
##
## Term DScientist
## (chr) (dbl)
## 1 R 34.37500
## 2 Python 34.37500
## 3 SQL 30.46875
## 4 Java 23.43750
## 5 Tableau 19.53125
## 6 Hadoop 18.75000
## 7 SAS 17.96875
## 8 Spark 17.18750
## 9 JavaScript 13.28125
## 10 Excel 11.71875
## .. ... ...
For data architects, 32.7% asked for SQL, 13.9% for Java, 13.9% for Hadoop, 11.9% for Tableau, 11.9% for Python and only 5.9% for R.
canada %>%
select(Term, DArchitect) %>%
arrange(desc(DArchitect))
## Source: local data frame [27 x 2]
##
## Term DArchitect
## (chr) (dbl)
## 1 SQL 32.673267
## 2 Java 13.861386
## 3 Hadoop 13.861386
## 4 Tableau 11.881188
## 5 Python 11.881188
## 6 JavaScript 9.900990
## 7 C++ 7.920792
## 8 Hive 7.920792
## 9 SAS 6.930693
## 10 R 5.940594
## .. ... ...
For data engineers, 66.7% asked for Python, 63.0% for Hadoop, 40.7% for Spark, 37.0% for SQL & Scala, and 18.5% for Java.
canada %>%
select(Term, DEngineer) %>%
arrange(desc(DEngineer))
## Source: local data frame [27 x 2]
##
## Term DEngineer
## (chr) (dbl)
## 1 Python 66.66667
## 2 Hadoop 62.96296
## 3 Spark 40.74074
## 4 SQL 37.03704
## 5 Scala 37.03704
## 6 Hive 33.33333
## 7 Java 18.51852
## 8 HBase 18.51852
## 9 NoSQL 14.81481
## 10 Pig 14.81481
## .. ... ...
Creating a master dataset for USA nationwide count:
You can find the datasets for US here
usa1 <- read_csv("national1.csv")
usa2 <- read_csv("national2.csv")
usa3 <- read_csv("national3.csv")
usa4 <- read_csv("national4.csv")
names(usa1) <- c("Term", "DAnalyst")
names(usa2) <- c("Term", "DScientist")
names(usa3) <- c("Term", "DArchitect")
names(usa4) <- c("Term", "DEngineer")
j1 <- full_join(usa1, usa2, by="Term")
j2 <- full_join(usa3, usa4, by="Term")
usa <- full_join(j1, j2, by="Term")
USA <- replace_na(usa, replace=list(DAnalyst=0, DScientist=0, DArchitect=0, DEngineer=0))
USA_long <- usa %>%
gather(job_title, percentage, DAnalyst: DEngineer) #tidy
USA %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst")) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
xlab("") +
ylab("Percentage appearing in job ads") +
ggtitle("Data Science Tools in demand in USA") +
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
For data analysts, 56.2% of job ads asked for Excel, 27.5% for SQL, 12.5% for Tableau, 12.4% for R, 10.3% for Java and 5.5% for SAS.
USA %>%
select(Term, DAnalyst) %>%
arrange(desc(DAnalyst))
## Source: local data frame [30 x 2]
##
## Term DAnalyst
## (chr) (dbl)
## 1 Excel 56.221854
## 2 SQL 27.470965
## 3 Tableau 12.491112
## 4 R 12.396302
## 5 Java 10.263096
## 6 SAS 5.451529
## 7 Hadoop 3.602749
## 8 Python 2.939085
## 9 SPSS 2.844276
## 10 JavaScript 2.630955
## .. ... ...
For data scientists, 37.7% of job ads asked for Python, 36.6% for R, 29.4% for SQL, 27.0% for Hadoop, 24.9% for Java, 18.7% for Matlab, 12.8% for SAS and only 4.6% for Excel.
USA %>%
select(Term, DScientist) %>%
arrange(desc(DScientist))
## Source: local data frame [30 x 2]
##
## Term DScientist
## (chr) (dbl)
## 1 Python 37.65892
## 2 R 36.56291
## 3 SQL 29.41692
## 4 Hadoop 27.04954
## 5 Java 24.85752
## 6 Matlab 18.67602
## 7 Spark 13.15213
## 8 SAS 12.80140
## 9 Hive 12.71372
## 10 C++ 11.74923
## .. ... ...
For data architects, 45.2% asked for SQL, 18.0% for Hadoop, 17.2% for Java, 16.1% for Tableau, 9.1% for Python, 7.9% for R.
USA %>%
select(Term, DArchitect) %>%
arrange(desc(DArchitect))
## Source: local data frame [30 x 2]
##
## Term DArchitect
## (chr) (dbl)
## 1 SQL 45.209581
## 2 Hadoop 17.964072
## 3 Java 17.215569
## 4 Tableau 16.092814
## 5 Spark 10.179641
## 6 NoSQL 10.029940
## 7 Python 9.131737
## 8 Hive 8.532934
## 9 R 7.859281
## 10 JavaScript 7.784431
## .. ... ...
For data engineers, 47.5% asked for SQL, 36.6% for Hadoop, 30.6% for Python, 28.7% for Java and 9.1% for R.
USA %>%
select(Term, DEngineer) %>%
arrange(desc(DEngineer))
## Source: local data frame [30 x 2]
##
## Term DEngineer
## (chr) (dbl)
## 1 SQL 47.48031
## 2 Hadoop 36.53543
## 3 Python 30.62992
## 4 Java 28.66142
## 5 Spark 20.78740
## 6 Hive 17.79528
## 7 NoSQL 12.99213
## 8 HBase 11.96850
## 9 Pig 11.10236
## 10 Scala 11.02362
## .. ... ...
For data analysts, Excel seems to be more popular in U.S than Canada. Over 40% of ads asked for SQL in Canada whereas only 27.5% in U.S. Less than 15% of ads asked for R and only 3% for Python in US.
For data scientists, Python is winning over R in U.S by 1.1%. However, still a tie in Canada. SQL needed.
Now, let’s see if the trends are different for the U.S. cities data compared to nationwide.
Loading the datasets with the individual cities’ data:
master_39cities <- read_csv("master_39cities.csv")
master_39cities_long <- master_39cities %>%
gather(job_title, percentage, DAnalyst: DEngineer) #tidydata
Now, let’s look at the data separately for each job title.
g1 <- master_39cities %>%
ggplot(aes(Term, DAnalyst)) +
geom_boxplot(color="red") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Analysts")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g2 <- master_39cities %>%
ggplot(aes(Term, DScientist)) +
geom_boxplot(color="purple") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Scientists")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g3 <- master_39cities %>%
ggplot(aes(Term, DArchitect)) +
geom_boxplot(color="green") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Architects")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g4 <- master_39cities %>%
ggplot(aes(Term, DEngineer)) +
geom_boxplot(color="turquoise") +
xlab("") +
ylab("Percentage of appearance in job ads") +
ggtitle("Data Science Tools in demand for Data Engineers")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g1
g2
g3
g4
g1<- master_39cities%>%
filter (City == "New York" | City=="Newark" | City =="Jersey" | City =="Washington") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g2 <- master_39cities%>%
filter (City == "Boston" | City=="Oakland" | City =="San Francisco" | City =="Fremont") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g3 <- master_39cities%>%
filter (City == "Los Angeles" | City=="Atlanta" | City =="LongBeach" | City =="Irving") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g4 <- master_39cities%>%
filter (City == "San Jose" | City=="Chicago" | City =="Anaheim" | City =="Dallas") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g5 <- master_39cities%>%
filter (City == "Garland" | City=="Plano" | City =="Arlington" | City =="Philadelphia") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g6 <- master_39cities%>%
filter (City == "Seattle" | City=="Minneapolis" | City =="St. Paul" | City =="Baltimore") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g7 <- master_39cities%>%
filter (City == "Denver" | City=="Santa Ana" | City =="St. Louis" | City =="Aurora") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g8 <- master_39cities%>%
filter (City == "Mesa" | City=="Phoenix" | City =="Scottsdale" | City =="Irvine") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g9 <- master_39cities%>%
filter (City == "Chandler" | City=="Gilbert" | City =="Glendale" | City =="Boulder") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g10 <- master_39cities%>%
filter (City == "Austin" | City=="Columbus" | City =="SanDiego") %>%
ggplot() +
geom_line(aes(Term, DAnalyst, group=1, color="Data Analyst" )) +
geom_line(aes(Term, DScientist, group=1, color="Data Scientist")) +
geom_line(aes(Term, DArchitect, group=1, color="Data Architect")) +
geom_line(aes(Term, DEngineer, group=1, color="Data Engineer")) +
facet_wrap(~City, nrow=2) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_text(size = 20))
g1
g2
g3
g4
g5
g6
g7
g8
g9
g10
For data scientists, Python is winning over R in west coast cities such as San francisco, San Jose, Oakland & Seattle. It was a bit more 50-50 in east coast, with Jersey City & Boston job ads asking for Python more but New York and Newark asking for R more.
For data analysts, almost all indeed job ads from 39 US cities asked for Excel & SQL.
More job ads ask for Python and Java for data engineers than they did for data architects.
The stackoverflow data had 362 job ads with about 60% of them from U.S. and 40% from countries around the world.
Let’s see if the trends that we observed above for U.S. are similar “internationally”.
load("data_scientist.rda")
stackoverflow <- data_scientist %>%
select(JobId, Title, Tags, LocationString, Description, CountryCode, StateCode)
Description <- stackoverflow %>%
select(Description)
data_scientist %>% select(CountryCode) %>% filter(CountryCode=="US") %>% nrow() #223 job ads from US
## [1] 223
data_scientist %>% select(CountryCode) %>% filter(CountryCode!="US") %>% nrow() #139 job ads from other countries around the world
## [1] 139
count1 <- function(term, lang) { #Counting just one variation of the term
lang = str_count(Description$Description, term) %>% data.frame()
names(lang) <- "count"
lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
#Scenerio 1: If the count is 0, the term didn't appear in the job ad.
#Scenerio 2: If the count is not equal to zero, the term appeared 1 or more times in the job ad.
#Since we are only interested in the proportion of job ads that contain the term, we will make the count = 1 for scenerio 2.
lang
}
count2 <- function(term1, term2, lang) { #Counting 2 variations of the term
a = str_count(Description$Description, term1) %>% data.frame()
names(a) <- "count"
b = str_count(Description$Description, term2) %>% data.frame()
names(b) <- "count"
lang <- a + b
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
count3 <- function(term1, term2, term3, lang) { #Counting 3 variations of the term
a = str_count(Description$Description, term1) %>% data.frame()
names(a) <- "count"
b = str_count(Description$Description, term2) %>% data.frame()
names(b) <- "count"
c = str_count(Description$Description, term3) %>% data.frame()
names(c) <- "count"
lang <- a + b + c
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
count7 <- function(term1, term2, term3, term4, term5, term6, term7, lang) { #Counting 7 variations of the term
a = str_count(Description$Description, term1) %>% data.frame()
names(a) <- "count"
b = str_count(Description$Description, term2) %>% data.frame()
names(b) <- "count"
c = str_count(Description$Description, term3) %>% data.frame()
names(c) <- "count"
d = str_count(Description$Description, term4) %>% data.frame()
names(d) <- "count"
e = str_count(Description$Description, term5) %>% data.frame()
names(e) <- "count"
f = str_count(Description$Description, term6) %>% data.frame()
names(f) <- "count"
g = str_count(Description$Description, term7) %>% data.frame()
names(g) <- "count"
lang <- a + b + c +d + e+ f + g
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
count3_fixed3 <- function(term1, term2, term3, lang){ #Counting 3 fixed variations of the term
a <- grepl(term1, Description$Description) %>% data.frame()
#When I was trying to count R, `str_count` was counting all R's in words, not just the specific one separate word "R".
#Therefore, I switched to grepl, that allows me to do so.
names(a) <- "count"
a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
b <- grepl(term2, Description$Description) %>% data.frame()
names(b) <- "count"
b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
c <- grepl(term3, Description$Description) %>% data.frame()
names(c) <- "count"
c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
lang <- a+b+c
lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
lang
}
count6_fixed6 <- function(term1, term2, term3, term4, term5, term6, lang){ #Counting 6 fixed variations of the term
a <- grepl(term1, Description$Description) %>% data.frame()
names(a) <- "count"
a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
b <- grepl(term2, Description$Description) %>% data.frame()
names(b) <- "count"
b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
c <- grepl(term3, Description$Description) %>% data.frame()
names(c) <- "count"
c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
d <- grepl(term4, Description$Description) %>% data.frame()
names(d) <- "count"
d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
e <- grepl(term5, Description$Description) %>% data.frame()
names(e) <- "count"
e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
f <- grepl(term6, Description$Description) %>% data.frame()
names(f) <- "count"
f <- f %>% mutate(count=ifelse(count==TRUE, 1, 0))
lang <- a+b+c+d+e+f
lang <- lang %>% mutate(count=ifelse(count == 0, 0, 1))
lang
}
count9_fixed5 <- function(term1, term2, term3, term4, #term1-5 are fixed words
term5, term6, term7, term8, term9, lang) {
a <- grepl(term1, Description$Description) %>% data.frame()
names(a) <- "count"
a <- a %>% mutate(count=ifelse(count==TRUE, 1, 0))
b <- grepl(term2, Description$Description) %>% data.frame()
names(b) <- "count"
b <- b %>% mutate(count=ifelse(count==TRUE, 1, 0))
c <- grepl(term3, Description$Description) %>% data.frame()
names(c) <- "count"
c <- c %>% mutate(count=ifelse(count==TRUE, 1, 0))
d <- grepl(term4, Description$Description) %>% data.frame()
names(d) <- "count"
d <- d %>% mutate(count=ifelse(count==TRUE, 1, 0))
e <- grepl(term5, Description$Description) %>% data.frame()
names(e) <- "count"
e <- e %>% mutate(count=ifelse(count==TRUE, 1, 0))
f = str_count(Description$Description, term6) %>% data.frame()
names(f) <- "count"
g = str_count(Description$Description, term7) %>% data.frame()
names(g) <- "count"
h = str_count(Description$Description, term8) %>% data.frame()
names(h) <- "count"
i = str_count(Description$Description, term9) %>% data.frame()
names(i) <- "count"
lang <- a + b + c +d + e+ f + g + h+ i
lang <- lang %>% mutate(count=ifelse(count== 0, 0, 1))
lang
}
stack_count that contains the columns that indicate if a particular job ad had the term or not (yes = 1, no =0)R <- count3_fixed3("\\<R\\>", "\\<R.\\>", "\\<R,\\>", R)
Python <- count2("Python", "python", Python)
Java <- count2("Java", "java", Java)
C_plus_plus <- count2("C\\+\\+", "c\\+\\+", C_plus_plus)
Ruby <- count2("Ruby", "ruby", Ruby)
Perl <- count2("Perl", "perl", Perl)
Matlab <- count2("Matlab", "matlab", Matlab)
Javascript <- count2("Javascript", "javascript", Javascript)
Scala <- count2("Scala", "scala", Scala)
Excel <- count2("Excel", "excel",Excel)
Tableau <- count2("Tableau","tableau", Tableau)
D3.js <- count2("D3.js","d3.js", D3.js)
SAS <- count2("SAS","sas", SAS)
SPSS <- count2("SPSS","spss", SPSS)
D3 <- count2("d3", "D3", D3)
Hadoop <- count2("Hadoop", "hadoop", Hadoop)
MapReduce <- count2("MapReduce","mapreduce", MapReduce)
Spark <- count2("Spark", "spark", Spark)
Pig <- count2("Pig","pig", Pig)
Hive <- count2("Hive", "hive", Hive)
Shark <- count2("Shark", "shark", Shark)
Oozie <- count2("Oozie", "oozie", Oozie)
Zookeeper <- count2("Zookeeper", "zookeeper", Zookeeper)
Flume <- count2("Flume", "flume", Flume)
Mahout <- count2("Mahout", "mahout", Mahout)
SQL <- count6_fixed6("\\<SQL\\>", "\\<sql\\>","\\<SQL.\\>", "\\<SQL,\\>", "\\<sql.\\>", "\\<sql,\\>", SQL)
NoSQL <- count2("NoSQL", "nosql", NoSQL)
HBase <- count2("HBase", "hbase", HBase)
Cassandra <- count2("Cassandra", "cassandra", Cassandra)
MongoDB <- count2("MongoDB","mongodb", MongoDB)
MachineLearning <- count2("Machine Learning", "machine learning", MachineLearning)
BigData <- count3("Big Data", "big data", "Big data", BigData)
PhD <- count3("PhD", "Ph.D", "Ph.D.", PhD)
Bachelors <- count7("Undergraduate", "undergraduate", "BA", "BS", "BSc", "Bachelor\'s", "Bachelors", Bachelors)
Masters <- count9_fixed5("\\<Ms\\>", "\\<MS\\>", "\\<M.S.\\>", "\\<MA\\>", "\\<MBA\\>", "MS degree","Master\'s", "Master", "Masters", Masters)
stack_count <- cbind(R, Python, Java, C_plus_plus, Ruby, Perl, Matlab, Javascript, Scala, Excel, Tableau, D3.js, SAS, SPSS, D3, Hadoop, MapReduce, Spark, Pig, Hive, Shark, Oozie, Zookeeper, Flume, Mahout, SQL, NoSQL, HBase, Cassandra, MongoDB, MachineLearning, BigData, PhD, Bachelors, Masters)
names(stack_count) <- c("R", "Python", "Java", "C_plus_plus", "Ruby", "Perl", "Matlab", "Javascript", "Scala", "Excel", "Tableau", "D3.js", "SAS", "SPSS", "D3", "Hadoop", "MapReduce", "Spark", "Pig", "Hive", "Shark", "Oozie", "Zookeeper", "Flume", "Mahout", "SQL", "NoSQL", "HBase", "Cassandra", "MongoDB", "MachineLearning", "BigData", "PhD", "Bachelors", "Masters")
stack_count <- cbind(stackoverflow, stack_count)
sum_count with columns term, count (total number job ads containg the term) & freq (% of job ads containing the term)sum_count <- stack_count %>%
summarize(R = sum(R), Python = sum(Python), Java= sum(Java),
`C++` = sum(C_plus_plus), Ruby= sum(Ruby), Perl= sum(Perl),
Matlab= sum(Matlab), Javascript= sum(Javascript), Scala= sum(Scala),
Excel= sum(Excel), Tableau= sum(Tableau), D3.js=sum(D3.js),
SAS=sum(SAS), SPSS=sum(SPSS), D3=sum(D3), Hadoop=sum(Hadoop),
MapReduce=sum(MapReduce),Spark=sum(Spark), Pig=sum(Pig), Hive=sum(Hive),
Shark=sum(Shark), Oozie=sum(Oozie), Zookeeper=sum(Zookeeper),
Flume=sum(Flume), Mahout=sum(Mahout), SQL=sum(SQL), NoSQL=sum(NoSQL),
HBase=sum(HBase), Cassandra=sum(Cassandra),
MongoDB=sum(MongoDB),MachineLearning=sum(MachineLearning),
BigData=sum(BigData), PhD=sum(PhD), Bachelors=sum(Bachelors),
Masters=sum(Masters)) %>%
tbl_df() %>%
gather(term, count, R:Masters) #count = number of job ads that include the `term`
sum_count <- sum_count %>% mutate(freq=count/362, count)
#freq = % of job ads that contain the `term`
sum_count %>%
filter(term != "Masters" & term!="PhD" & term!="Bachelors") %>%
arrange(desc(freq))
## Source: local data frame [32 x 3]
##
## term count freq
## (chr) (dbl) (dbl)
## 1 MachineLearning 159 0.43922652
## 2 BigData 88 0.24309392
## 3 R 75 0.20718232
## 4 Python 65 0.17955801
## 5 Scala 53 0.14640884
## 6 Hadoop 50 0.13812155
## 7 SQL 44 0.12154696
## 8 Excel 42 0.11602210
## 9 Java 35 0.09668508
## 10 Spark 31 0.08563536
## .. ... ... ...
sum_count %>%
filter(term == "Masters" | term =="PhD" | term =="Bachelors") %>%
arrange(desc(freq))
## Source: local data frame [3 x 3]
##
## term count freq
## (chr) (dbl) (dbl)
## 1 PhD 32 0.08839779
## 2 Masters 29 0.08011050
## 3 Bachelors 19 0.05248619
sum_count %>%
filter(term != "Masters" & term!= "PhD" & term!="Bachelors") %>%
ggplot(aes(term, freq)) +
geom_point(size=2) +
geom_line(aes(term, freq, group=1, color="maroon")) +
xlab("") +
ylab("Percentage of appearance in job postings") +
ggtitle("Percentage of job ads with a data science tool")+
theme( text=element_text(size=20),
axis.text.x=element_text(angle=90, vjust=1),
plot.title = element_text(size = rel(1.1)),
legend.text = element_blank())
For stackover flow job data, R is still winning over Python by 3%. Hadoop and SQL were equally popular for data scientists. Please note that about 44% of stack overflow job ads for data scientists mentioned machine learning! It seems that it’s becoming important for data scientists to be familiar with machine learning principles and algorithms. Surprisingly, only 24% mentioned Big Data.
pal2 <- brewer.pal(8,"Dark2")
wordcloud(sum_count$term, sum_count$freq, scale = c(4, 1.5), random.order = FALSE, colors=pal2, rot.per = 0, fixed.asp = FALSE)
Description1 <- Description %>%
mutate(Description= gsub("\\[|\\]", "", Description)) %>%
mutate(Description= gsub("rel=\"nofollow\"", "", Description)) %>%
mutate(Description= gsub("<li>experience", "", Description)) %>%
mutate(Description= gsub("<li>work", "", Description)) %>%
mutate(Description= gsub("<li><span>", "", Description)) %>%
mutate(Description= gsub("<p>", "", Description)) %>%
mutate(Description= gsub("</p>", "", Description)) %>%
mutate(Description= gsub("<p><strong>", "", Description)) %>%
mutate(Description= gsub("<ul>", "", Description)) %>%
mutate(Description= gsub("</ul>", "", Description)) %>%
mutate(Description= gsub("<li>", "", Description)) %>%
mutate(Description= gsub("</li>", "", Description)) %>%
mutate(Description= gsub("&", "", Description)) %>%
mutate(Description= gsub(" ", "", Description)) %>%
mutate(Description= gsub("</strong>", "", Description)) %>%
mutate(Description= gsub("<strong>", "", Description)) %>%
mutate(Description= gsub("<span>", "", Description)) %>%
mutate(Description= gsub("<br>", "", Description)) %>%
mutate(Description= gsub("<em>", "", Description)) %>%
mutate(Description= gsub("</span>", "", Description)) %>%
mutate(Description= gsub("<a href", "", Description)) %>%
mutate(Description= gsub("=\"http://www", "", Description)) %>%
mutate(Description= gsub("&rsquo", "", Description))
r_words <- c("right", "youll", "get", "next", "high", "bull","way", "etc", "didate", "based", "every", "take", "ndash", "real", "core", "key", "day", "able", "van", "time", "used", "highly", "years", "sets", "want", "seeking", "senior", "related", "join", "ability", "new", "teams", "role", "use", "like", "make", "across", "provide", "drive", "help", "work", "working", "including", "questions", "environment", "implement", "developing", "products", "responsible", "need", "closely", "recommendations", "just", "ing", "set", "andor", "company","apply","part","analysing","analyse","using","can","within","will","youll","well","looking","position","one","also")
review_text <- paste(Description1$Description, collapse=" ")
review_source <- VectorSource(review_text)
corpus <- Corpus(review_source)
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, removeWords, r_words)
dtm <- DocumentTermMatrix(corpus)
dtm2 <- as.matrix(dtm)
frequency <- colSums(dtm2)
frequency <- sort(frequency, decreasing=TRUE)
head(frequency)
## data team experience business scientist learning
## 2249 564 382 378 344 326
words <- names(frequency)
wordcloud(words, frequency, scale=c(3.5,.7), min.freq=30, random.order=FALSE, colors=brewer.pal(8, "Dark2"), max.words = 250, rot.per = 0, fixed.asp = FALSE, random.color = TRUE)
So here are my takeaways from this analysis!
You can pick the programming languages to learn, depending on which career in the “data science universe” you want to puruse.
Although Python seems to be gaining momentum over R, they have different strengths, with R being a language developed for statisticians (easier to apply statistical concepts without much experience in programming) vs. Python being a multi-purpose language. So you may want to pick up both for different purposes.
Companies seem to prefer open source languages such as R and Python over commercial one such as Matlab and SAS. Very good news!
It seems that machine learning will become more and more important for data scientists to be familiar with.
Currently, east coast cities have more data science jobs than west coast. So hooray for the east coasters!
However, Silicon valley have more jobs data scientists than data analysts and more for data engineers than data architects. In contrast, there are more jobs for data analysts than data scientists in east coast. East coast cities also have more jobs for data engineers than data architects but not as much as Sillicon valley.